Ngram Search Engine

نویسنده

  • Satoshi Sekine
چکیده

In this paper, we will describe an idea and its implementation for an ngram search engine for very large sets of ngrams. The engine supports queries with an arbitrary number of wildcards. It takes a fraction of a second for a search, and can provide the fillers of the wildcards. We implemented the system using two datasets. One is the 1 billion 5-grams provided by Google (Web 1T data), the other a set of 119 million 9grams created from 82 years of newspaper. The system runs on a single Linux PC with reasonable size of memory (less than 4GB) and disk space (less than 400GB). This system can be a very useful tool for knowledge discovery and other NLP tasks.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Proposal for Enhancement of Elasticsearch by mitigating n-Gram Indexing

Searching is one of the most important activity in the world of Internet. Whenever one looks for any information in the World-Wide Web (WWW), the very first activityperformed is searching. As the amount of data in World-Wide Web (WWW) is increasing at a very fast rate, it is becoming very difficult to derive useful information from it. It allows every ordinary user to publish data that can be r...

متن کامل

Peachnote: Music Score Search and Analysis Platform

Hundreds of thousands of music scores are being digitized by libraries all over the world. In contrast to books, they generally remain inaccessible for content-based retrieval and algorithmic analysis. There is no analogue to Google Books for music scores, and there exist no large corpora of symbolic music data that would empower musicology in the way large text corpora are empowering computati...

متن کامل

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). It outputs the matched ngrams with their frequencies as well as all the context...

متن کامل

Introducing Linggle: From Concordance to Linguistic Search Engine

We introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. Unlike a typical concordance, Linggle accepts queries with keywords, wildcard, wild part of speech (PoS), synonymous words, and additional regular expression (RE) operators, and returns bundles with frequency counts. In our approach, we argument Google Web 1T corpus with inv...

متن کامل

Linggle: a Web-scale Linguistic Search Engine for Words in Context

In this paper, we introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. The query might contain keywords, wildcards, wild parts of speech (PoS), synonyms, and additional regular expression (RE) operators. In our approach, we incorporate inverted file indexing, PoS information from BNC, and semantic indexing based on Latent Dirichl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008